Potential-Based Shaping and Q-Value Initialization are Equivalent
نویسنده
چکیده
Shaping has proven to be a powerful but precarious means of improving reinforcement learning performance. Ng, Harada, and Russell (1999) proposed the potential-based shaping algorithm for adding shaping rewards in a way that guarantees the learner will learn optimal behavior. In this note, we prove certain similarities between this shaping algorithm and the initialization step required for several reinforcement learning algorithms. More specifically, we prove that a reinforcement learner with initial Q-values based on the shaping algorithm’s potential function make the same updates throughout learning as a learner receiving potential-based shaping rewards. We further prove that under a broad category of policies, the behavior of these two learners are indistinguishable. The comparison provides intuition on the theoretical properties of the shaping algorithm as well as a suggestion for a simpler method for capturing the algorithm’s benefit. In addition, the equivalence raises previously unaddressed issues concerning the efficiency of learning with potential-based shaping. 1. Potential-Based Shaping Shaping is a common technique for improving learning performance in reinforcement learning tasks. The idea of shaping is to provide the learner with supplemental rewards that encourage progress towards highly rewarding states in the environment. If these shaping rewards are applied arbitrarily, they run the risk of distracting the learner from the intended goals in the environment. In this case, the learner converges on a policy that is optimal in the presence of the shaping rewards, but suboptimal in terms of the original task. Ng, Harada, and Russell (1999) proposed a method for adding shaping rewards in a way that guarantees the optimal policy maintains its optimality. They model a reinforcement learning task as a Markov Decision Process (MDP), where the learner tries to find a policy that maximizes discounted future reward (Sutton & Barto, 1998). They define a potential function Φ(·) over the states. The shaping reward for transitioning from state s to s′ is defined in terms of Φ as: F (s, s′) = γΦ(s′)− Φ(s), where γ is the MDP’s discount rate. This shaping reward is added to the environmental reward for every state transition the learner experiences. The potential function can be viewed as defining a topography over the state space. The shaping reward for transitioning from one state to another is therefore the discounted change in this state potential. Potential-based shaping guarantees that no cycle through a sequence of states yields a net c ©2003 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.
منابع مشابه
Theoretical considerations of potential-based reward shaping for multi-agent systems
Potential-based reward shaping has previously been proven to both be equivalent to Q-table initialisation and guarantee policy invariance in single-agent reinforcement learning. The method has since been used in multi-agent reinforcement learning without consideration of whether the theoretical equivalence and guarantees hold. This paper extends the existing proofs to similar results in multi-a...
متن کاملFuzzy transferable-utility games: a weighted allocation and related results
By considering the supreme-utilities among fuzzy sets and the weights among participants simultaneously, we introduce the supreme-weighted value on fuzzy transferable-utility games. Further, we provide some equivalent relations to characterize the family of all solutions that admit a potential on weights. We also propose the dividend approach to provide alternative viewpoint for the potential a...
متن کاملOn the Distribution and Moments of Record Values in Increasing Populations
Consider a sequence of n independent observations from a population of increasing size αi, i = 1,2,... and an absolutely continuous initial distribution function. The distribution of the kth record value is represented as a countable mixture, with mixing the distribution of the kth record time and mixed the distribution of the nth order statistic. Precisely, the distribution function and (pow...
متن کاملAn Optimal Selection of Induction Heating Capacitance by Genetic Algorithm Considering Dissipation Loss Caused by ESR (TECHNICAL NOTE)
In design of a parallel resonant induction heating system, choosing a proper capacitancefor the resonant circuit is quite important. The capacitance affects the resonant frequency, outputpower, Q-factor, heating efficiency and power factor. In this paper, the role of equivalent seriesresistance (ESR) in the choice of capacitance is significantly recognized. Optimal value of resonancecapacitor i...
متن کاملPotential-based Shaping in Model-based Reinforcement Learning
Potential-based shaping was designed as a way of introducing background knowledge into model-free reinforcement-learning algorithms. By identifying states that are likely to have high value, this approach can decrease experience complexity—the number of trials needed to find near-optimal behavior. An orthogonal way of decreasing experience complexity is to use a model-based learning approach, b...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- J. Artif. Intell. Res.
دوره 19 شماره
صفحات -
تاریخ انتشار 2003